Introduction

Row

The Project

Executive Summary

This project examines the house sale data from Chennai, India. The goals are to try to predict the house sale price. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we perform Regression analysis predicting the house sale price for the neighborhoods. Second, we will perform Classification analysis predicting if a home area has a high or low sale price. Finally, we end with summarizing our conclusions.

The best regression model was the regression tree with a accuracy of 91% and the best classification model was the classification tree with a Sensitivity of 92%*.

We find that the variables that decrease home sale price are:

  • Number of Bedrooms (N_BEDROOM)
  • Number of Bathrooms (N_BATHROOM)
  • The build house type (BUILDTYPE)

The variables that increase home sale price are:

  • Interior square footage (INT_SQFT)
  • Number of Rooms (N_ROOM)
  • If there is Parking (PARK_FACIL)

The Problem Description

This project examines housing data from Chennai, India. We will perform both regression and classification analysis. The goal for the regression models is to predict the house sale price using the variables in the dataset. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we plan to perform Regression analysis predicting the house sale price for the areas We will use a variety of methods including linear regression, regression trees, and lasso regression. Second, we will perform Classification analysis predicting if an area has a high or low house sale value. We will use both logistic regression and classification trees. Finally, we end with summarizing our conclusions. We will examine the variables in the dataset to determine what helps to predict the house sale value.

The Data

This data set has 7109 rows and 16 variables.

The Data

VARIABLES TO PREDICT WITH

  • AREA: The area a house is located in Chennai
  • INT_SQFT: square footage of house
  • DIST_MAINROAD: distance of house from main road
  • N_BEDROOM: number of bedrooms
  • N_BATHROOM: number of bathrooms
  • N_ROOM: number of rooms
  • SALE_COND: condition of house at time of sale
  • PARK_FACIL: is parking available or not (Yes or No)
  • BUILD_TYPE: purpose of house
  • UTILITY_AVAIL: facilities available at house
  • PAVED: If street accessing home is paved or not (Yes or No)

VARIABLES WE WANT TO PREDICT

  • SALES_PRICE: Sale price of house
  • SALES_PRICEhigh: Sale Price > $10M coded as Yes, lower coded as No

Data Exploration

Column

View the Data Summaries

Now we can see the range of values for each variable. The AREA (House neighborhood) variables is truncated but we can see the values in the bottom table.
     AREA              INT_SQFT    DIST_MAINROAD     N_BEDROOM    
 Length:7109        Min.   : 500   Min.   :  0.0   Min.   :1.000  
 Class :character   1st Qu.: 993   1st Qu.: 50.0   1st Qu.:1.000  
 Mode  :character   Median :1373   Median : 99.0   Median :1.000  
                    Mean   :1382   Mean   : 99.6   Mean   :1.637  
                    3rd Qu.:1744   3rd Qu.:148.0   3rd Qu.:2.000  
                    Max.   :2500   Max.   :200.0   Max.   :4.000  
                                                   NA's   :1      
   N_BATHROOM        N_ROOM            SALE_COND    PARK_FACIL
 Min.   :1.000   Min.   :2.000   AdjLand    :1433   No :3520  
 1st Qu.:1.000   1st Qu.:3.000   Partial    :1429   Noo:   2  
 Median :1.000   Median :4.000   Normal Sale:1423   Yes:3587  
 Mean   :1.213   Mean   :3.689   AbNormal   :1406             
 3rd Qu.:1.000   3rd Qu.:4.000   Family     :1403             
 Max.   :2.000   Max.   :6.000   Adj Land   :   6             
 NA's   :5                       (Other)    :   9             
      BUILDTYPE    UTILITY_AVAIL   SALES_PRICE       SALES_PRICEhigh PAVED     
 Comercial :   4   All Pub:   1   Min.   : 2156875   Yes:3813        Yes:2560  
 Commercial:2325   AllPub :1886   1st Qu.: 8272100   No :3296        No :4549  
 House     :2444   ELO    :1522   Median :10335050                             
 Other     :  26   NoSeWa :1871   Mean   :10894910                             
 Others    :2310   NoSewr :1829   3rd Qu.:12993900                             
                                  Max.   :23667340                             
                                                                               

Column

Average Sales Price by PAVED (Street accessing home is paved)

PAVED n mean(SALES_PRICE)
Yes 2560 11057905
No 4549 10803182

Average Sales Price by AREA (neighborhood house is in)

AREA n mean(SALES_PRICE)
Chrompet 1681 10016662
Karapakkam 1363 7338627
Other 1307 11692237
KK Nagar 996 12695806
Velachery 979 11046654
Anna Nagar 783 15159519

Data Visualization

Response Variables relationships with predictors

  • We can see we have about half of the data as high sales price (>$10M). Looking at the potential predictors related to High Sales Price, we see the strongest relationships with square footage, number of rooms, and likely area.

  • We see the largest concentration of values around $8M-$12M. The data is also skewed to the right. We can see a decent increase in number of values around $20M. This is actually due to truncation of the data.

Row

High Sales Price

Sales Price

Row

Sales Price vs Categorial Variables

Sales Price vs Continuous Variables

High Sales Price vs Continuous Variables

High Sales Price vs Categorical Variables

Regression Model

Row{data-height=2000, column-width = 700, .tabset .tabset-fade}

Linear Regression Full

Full Model Results

The Full Regression Model Coefficients

term estimate std.error statistic p.value
(Intercept) 10894686.43 14936.17 729.42 0.00
INT_SQFT 2366022.24 58201.87 40.65 0.00
DIST_MAINROAD 7346.79 14970.29 0.49 0.62
N_BEDROOM -249711.65 52896.85 -4.72 0.00
N_BATHROOM -339960.79 26556.61 -12.80 0.00
N_ROOM 557057.89 68201.69 8.17 0.00
AREA_Chrompet -272393.69 30945.19 -8.80 0.00
AREA_Karapakkam -1360098.63 30467.28 -44.64 0.00
AREA_KK.Nagar -890012.21 31070.79 -28.64 0.00
AREA_Other -206796.30 25524.31 -8.10 0.00
AREA_Velachery -1228373.72 28195.52 -43.57 0.00
SALE_COND_AbNormal 98520.41 225145.09 0.44 0.66
SALE_COND_Adj.Land -4387.63 22186.04 -0.20 0.84
SALE_COND_AdjLand 244137.75 226742.72 1.08 0.28
SALE_COND_Family 52116.42 224949.94 0.23 0.82
SALE_COND_Normal.Sale 113338.24 226122.50 0.50 0.62
SALE_COND_Partial -24567.10 226504.77 -0.11 0.91
SALE_COND_Partiall -9371.94 21648.29 -0.43 0.67
SALE_COND_PartiaLl -10957.94 16383.11 -0.67 0.50
PARK_FACIL_Noo -34091.75 14959.45 -2.28 0.02
PARK_FACIL_Yes 508950.92 14972.10 33.99 0.00
BUILDTYPE_Commercial 561615.45 296018.05 1.90 0.06
BUILDTYPE_House -1648676.04 299683.26 -5.50 0.00
BUILDTYPE_Other -159858.75 40879.41 -3.91 0.00
BUILDTYPE_Others -1317164.68 295520.95 -4.46 0.00
UTILITY_AVAIL_AllPub -134299.29 556792.63 -0.24 0.81
UTILITY_AVAIL_ELO -246187.92 517328.67 -0.48 0.63
UTILITY_AVAIL_NoSeWa -250714.90 555370.78 -0.45 0.65
UTILITY_AVAIL_NoSewr. -198612.21 551290.17 -0.36 0.72
PAVED_No 17726.73 15118.07 1.17 0.24

Analysis Summary

After examining this model, we determine that there are some predictors that are not important in predicting the house sale price, so a pruned version of the model is created by removing predictors that are not significant.

model RMSE MAE RSQ
Linear Model 1256103 991227.4 0.89

Linear Regression Final

For this analysis we will use a pruned Linear Regression Model. We removed Distance from the Mainroad (DIST_MAINROAD), Condition of the house at time of sale, If the street in front of the house is paved, and utilities available.

Final Model Results

The Final Regression Model Coefficients

term estimate std.error statistic p.value
(Intercept) 10894782.38 14931.67 729.64 0.00
INT_SQFT 2365317.52 58156.36 40.67 0.00
N_BEDROOM -249340.96 52808.76 -4.72 0.00
N_BATHROOM -339918.37 26506.29 -12.82 0.00
N_ROOM 557466.25 68158.16 8.18 0.00
AREA_Chrompet -271559.80 30911.53 -8.79 0.00
AREA_Karapakkam -1359234.92 30425.24 -44.67 0.00
AREA_KK.Nagar -889680.75 30738.45 -28.94 0.00
AREA_Other -206081.61 25483.53 -8.09 0.00
AREA_Velachery -1228241.03 28138.68 -43.65 0.00
SALE_COND_AdjLand 138834.37 16371.54 8.48 0.00
SALE_COND_Family -52483.34 16361.92 -3.21 0.00
SALE_COND_Partial -129584.16 16374.81 -7.91 0.00
PARK_FACIL_Noo -33982.92 14952.94 -2.27 0.02
PARK_FACIL_Yes 508585.92 14957.27 34.00 0.00
BUILDTYPE_Commercial 570177.89 295848.56 1.93 0.05
BUILDTYPE_House -1639490.35 299507.45 -5.47 0.00
BUILDTYPE_Other -158304.43 40848.85 -3.88 0.00
BUILDTYPE_Others -1308923.11 295354.31 -4.43 0.00
UTILITY_AVAIL_AllPub 123298.13 16039.32 7.69 0.00
UTILITY_AVAIL_NoSewr. 55816.58 16064.13 3.47 0.00

Residual Assumptions Explorations

Compare actual (SALES_PRICE) vs predicted (y_hat) for pruned regression model

model RMSE MAE RSQ
Linear Model 1256103 991227.4 0.89
Linear Final Model 1256567 991979.4 0.89

Predicting Categorical sales price

Here is a look at a logistic regression model predicting high sales price.
term estimate std.error statistic p.value
(Intercept) 2.035 882.751 0.002 0.998
AREAChrompet 0.122 0.543 0.225 0.822
AREAKarapakkam 6.371 0.530 12.009 0.000
AREAKK Nagar 2.855 0.562 5.076 0.000
AREAOther 0.585 0.543 1.077 0.282
AREAVelachery 5.236 0.566 9.245 0.000
INT_SQFT -0.009 0.001 -17.074 0.000
DIST_MAINROAD 0.000 0.001 0.304 0.761
N_BEDROOM 1.690 0.303 5.581 0.000
N_BATHROOM 0.335 0.231 1.453 0.146
N_ROOM -1.710 0.270 -6.341 0.000
SALE_CONDAbNormal -0.576 1.448 -0.398 0.691
SALE_CONDAdj Land -11.531 284.310 -0.041 0.968
SALE_CONDAdjLand -1.372 1.448 -0.948 0.343
SALE_CONDFamily -0.618 1.448 -0.426 0.670
SALE_CONDNormal Sale -0.636 1.448 -0.439 0.660
SALE_CONDPartial -0.055 1.448 -0.038 0.969
SALE_CONDPartiall 2.509 49.184 0.051 0.959
SALE_CONDPartiaLl -9.384 882.745 -0.011 0.992
PARK_FACILNoo 13.630 610.826 0.022 0.982
PARK_FACILYes -1.952 0.114 -17.085 0.000
BUILDTYPECommercial -0.727 3.184 -0.228 0.819
BUILDTYPEHouse 6.812 3.188 2.136 0.033
BUILDTYPEOther 6.493 3.284 1.978 0.048
BUILDTYPEOthers 5.244 3.186 1.646 0.100
UTILITY_AVAILAllPub 8.015 882.743 0.009 0.993
UTILITY_AVAILELO 8.681 882.743 0.010 0.992
UTILITY_AVAILNoSeWa 8.648 882.743 0.010 0.992
UTILITY_AVAILNoSewr 8.385 882.743 0.009 0.992
PAVEDNo -0.255 0.105 -2.428 0.015
.metric .estimate
accuracy 0.917
specificity 0.925
sensitivity 0.909

Regression Tree Analysis

Row{data-height=2000, column-width = 700, .tabset .tabset-fade}

Regression Tree

We will predict the home sale price with all the variables.

View the regression tree.

We see it has 8 leaf nodes.

View the Variable Importance Plot

Compare actual (SALES_PRICE) vs predicted (y_hat)

Compare the Metrics

model RMSE MAE RSQ
Linear Model 1256103 991227.4 0.89
Linear Final Model 1256567 991979.4 0.89
Reg Tree Model 1506578 1213475.2 0.84

Tuned Regression Tree

Will tuning improve performance? We’ll use cross validation on the cost complexity and the tree depth.
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 6

Computational engine: rpart 

Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
    cp = 1e-10, maxdepth = 6L)

View the regression tree.

We see it has 25 leaf nodes.

View the Variable Importance Plot

Compare actual (SALES_PRICE) vs predicted (y_hat) for tuned tree

Compare the metrics

model RMSE MAE RSQ
Linear Model 1256103 991227.4 0.89
Linear Final Model 1256567 991979.4 0.89
Reg Tree Model 1506578 1213475.2 0.84
Tuned Reg Tree Model 1085629 875359.8 0.92

Classification Analysis

Row

Classification Models

UPDATE When predicting the house sale price high/low variable (SALES_PRICEhigh) we have coded it so that Yes means it is high (> 10,000,000) and No otherwise. For this analysis we will perform a classification tree and an logistic regression. Both models have a sensitivity of around 91%. If I had to choose a single model I would choose the classification tree since it is easier to explain.

Row

Classification Trees

We will use all the variables except SALES_PRICE because this is what the SALES_PRICEhigh is created from. For this model we will set the cost complexity to .001.

Variable Importance

Here we view the variable importance measures. The higher the value, the more important.

View the Classification Tree Plot

We can see we have 5 leaf nodes.

Confusion matrix

          Truth
Prediction  Yes   No
       Yes 3468  238
       No   345 3058

View the Metrics

model Accuracy Sensitivity Specificity
Classification Tree Model 0.92 0.91 0.93

Checking the Cutoff

[1] "Best Cutoff 0.6071 Sensitivity 0.9095 Specificity 0.9278 AUC for Model 0.9671"

Confusion matrix for Classification Cutoff 60%

          Truth
Prediction  Yes   No
       Yes 3468  238
       No   345 3058

Metrics for Classification Cutoff 20%

model Accuracy Sensitivity Specificity
Classification Tree Model 0.92 0.91 0.93
Classification Tree Model 60% Cutoff 0.92 0.91 0.93

Logistic Regression

For our final model, we will use logistic regression to explore High Sales Price. We can see the # of rooms, bathrooms, bedreooms & interrior square footage per dwelling are most important in the model.**

Logistic Regression Equation

term estimate std.error statistic p.value
(Intercept) -13.78 882.75 -0.02 0.99
AREAChrompet 0.12 0.54 0.23 0.82
AREAKarapakkam 6.37 0.53 12.01 0.00
AREAKK Nagar 2.86 0.56 5.08 0.00
AREAOther 0.58 0.54 1.08 0.28
AREAVelachery 5.24 0.57 9.25 0.00
INT_SQFT -4.21 0.25 -17.07 0.00
DIST_MAINROAD 0.02 0.05 0.30 0.76
N_BEDROOM 1.36 0.24 5.58 0.00
N_BATHROOM 0.14 0.09 1.45 0.15
N_ROOM -1.74 0.27 -6.34 0.00
SALE_CONDAbNormal -0.58 1.45 -0.40 0.69
SALE_CONDAdj Land -11.53 284.31 -0.04 0.97
SALE_CONDAdjLand -1.37 1.45 -0.95 0.34
SALE_CONDFamily -0.62 1.45 -0.43 0.67
SALE_CONDNormal Sale -0.64 1.45 -0.44 0.66
SALE_CONDPartial -0.06 1.45 -0.04 0.97
SALE_CONDPartiall 2.51 49.18 0.05 0.96
SALE_CONDPartiaLl -9.38 882.74 -0.01 0.99
PARK_FACILNoo 13.63 610.83 0.02 0.98
PARK_FACILYes -1.95 0.11 -17.09 0.00
BUILDTYPECommercial -0.73 3.18 -0.23 0.82
BUILDTYPEHouse 6.81 3.19 2.14 0.03
BUILDTYPEOther 6.49 3.28 1.98 0.05
BUILDTYPEOthers 5.24 3.19 1.65 0.10
UTILITY_AVAILAllPub 8.01 882.74 0.01 0.99
UTILITY_AVAILELO 8.68 882.74 0.01 0.99
UTILITY_AVAILNoSeWa 8.65 882.74 0.01 0.99
UTILITY_AVAILNoSewr 8.39 882.74 0.01 0.99
PAVEDNo -0.26 0.11 -2.43 0.02

Pruned Logistic Regression Equation

term estimate std.error statistic p.value
(Intercept) -0.18 0.03 -5.98 0
INT_SQFT -0.50 0.09 -5.38 0
N_BEDROOM 1.60 0.10 15.40 0
N_BATHROOM 0.27 0.04 6.73 0
N_ROOM -2.77 0.15 -17.88 0

Examine the Confusion Matrix

          Truth
Prediction  Yes   No
       Yes 2882  758
       No   927 2536

Variable Importance

Here we view the variable importance measures. The higher the value, the more important.

Confusion matrix

          Truth
Prediction  Yes   No
       Yes 2882  758
       No   927 2536

View the Metrics

model Accuracy Sensitivity Specificity
Classification Tree Model 0.92 0.91 0.93
Classification Tree Model 60% Cutoff 0.92 0.91 0.93
Pruned Logistic Model 0.76 0.76 0.77

Checking the Cutoff

[1] "Best Cutoff 0.5224 Sensitivity 0.7467 Specificity 0.7884 AUC for Model 0.8474"

Confusion matrix for Logistic Cutoff 52%

          Truth
Prediction  Yes   No
       Yes 2847  701
       No   962 2593

Metrics for Logistic Cutoff 25%

model Accuracy Sensitivity Specificity
Classification Tree Model 0.92 0.91 0.93
Classification Tree Model 60% Cutoff 0.92 0.91 0.93
Pruned Logistic Model 0.76 0.76 0.77
Logistic Model 25% Cutoff 0.77 0.75 0.79

Conclusion

Summary

UPDATE In Conclusion, we can see that our predictors do help to predict the house sale price, either the high/low sale price (with cutoff at $10,000,000) or the actual sale price.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:

Predicting Continuous Home Sale Price

In addition, if we compare the models that we examined for predicting continuous home sale price, we see that the regression tree has larger r-squared.

  • Linear Regression RSQ: 0.89
  • Pruned Regression Tree RSQ: 0.92

Summary Metrics Table

model RMSE MAE RSQ
Linear Model 1256103 991227.4 0.89
Linear Final Model 1256567 991979.4 0.89
Reg Tree Model 1506578 1213475.2 0.84
Tuned Reg Tree Model 1085629 875359.8 0.92

Actual vs Predicted Plot

Predicting Categorical Median Value

And if we compare the models we examined for predicting the categorical response high sale price, we see that the classification tree has higher accuracy.

  • Classification Tree (cutoff .60) Accuracy .92 Sensitivity .91
  • Logistic Regression (cutoff .52) Accuracy .77 Sensitivity .75

Summary Metrics Table

model Accuracy Sensitivity Specificity
Classification Tree Model 0.92 0.91 0.93
Classification Tree Model 60% Cutoff 0.92 0.91 0.93
Pruned Logistic Model 0.76 0.76 0.77
Logistic Model 25% Cutoff 0.77 0.75 0.79

ROC Curves

---
title: "Project Dashboard"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: scroll
    source_code: embed
    theme: yeti
---

```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output

library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.4
library(MASS) #v7.3-58.2 for Boston data
library(flexdashboard) #v0.6.0
library(plotly) #v4.10.1
library(crosstalk) #v1.2.0
library(knitr) #v1.42 kable()
library(tidymodels) 
  #library(parsnip) #v1.1.0 linear_reg(), set_engine(), set_mode(), fit(), predict()
  #library(yardstick) #v1.2.0 metrics(), rac_auc(), roc_curve(), metric_set(), conf_matrix()
  #library(dplyr) #v1.1.2 %>%, select(), select_if(), filter(), mutate(), group_by(), 
    #summarize(), tibble()
  #library(ggplot2) #v3.4.2 ggplot()
  #library(broom) #v1.0.5 for tidy(), augment(), glance()
  #library(rsample) #v1.1.1 initial_split()
```

```{r load_data}
#Load the data
df <- read.csv("Chennai houseing sale.csv")
#creating categorical variable for sales price, dummy variable for paved street & making sure both are factors
#removing variables we will no be using for analysis
# removing QS variables bc masked data
# removing registration fee & commission because those are dependent on sales price

df <- df %>% 
  dplyr::select(-QS_ROOMS,-QS_BATHROOM,-QS_BEDROOM,-QS_OVERALL,
                -COMMIS,-REG_FEE,-MZZONE,-PRT_ID,-DATE_BUILD,-DATE_SALE)

df <- df %>% 
  mutate(SALES_PRICEhigh =
           factor(if_else(SALES_PRICE>10000000,"Yes","No"),levels=c("Yes","No")),
         PAVED =factor(if_else(STREET=="Paved","Yes","No"),levels=c("Yes","No")),
         BUILDTYPE = factor(BUILDTYPE),
         UTILITY_AVAIL = factor(UTILITY_AVAIL),
         SALE_COND = factor(SALE_COND),
         PARK_FACIL = factor(PARK_FACIL),
         AREA = factor(AREA)) %>% 
  select(-STREET)


df <- df %>% 
  mutate(AREA=if_else(AREA=="Chrompet","Chrompet",
                      if_else(AREA =="Karapakkam","Karapakkam",
                              if_else(AREA=="KK Nagar","KK Nagar",
                                      if_else(AREA=="Velachery","Velachery",
                                              if_else(AREA=="Anna Nagar","Anna Nagar","Other"))))))
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=600}
-----------------------------------------------------------------------
### The Project
#### Executive Summary
This project examines the house sale data from Chennai, India. The goals are to try to predict the house sale price. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we perform Regression analysis predicting the house sale price for the neighborhoods. Second, we will perform Classification analysis predicting if a home area has a high or low sale price. Finally, we end with summarizing our conclusions.

The **best regression model was the regression tree** with a *accuracy of 91%* and the **best classification model was the classification tree** with a *Sensitivity of* 92%\*.

We find that the variables that decrease home sale price are:

-   Number of Bedrooms (N_BEDROOM)
-   Number of Bathrooms (N_BATHROOM)
-   The build house type (BUILDTYPE)


The variables that increase home sale price are:

-   Interior square footage (INT_SQFT)
-   Number of Rooms (N_ROOM)
-   If there is Parking (PARK_FACIL)

#### The Problem Description
This project examines housing data from Chennai, India. We will perform both regression and classification analysis. The goal for the **regression models is to predict the house sale price** using the variables in the dataset. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we plan to perform Regression analysis predicting the house sale price for the areas We will use a variety of methods including **linear regression, regression trees, and lasso regression**. Second, we will perform **Classification analysis predicting if an area has a high or low house sale value**. We will use both **logistic regression and classification trees.** Finally, we end with summarizing our conclusions. We will examine the variables in the dataset to determine what helps to predict the house sale value.

#### The Data
This data set has 7109 rows and 16 variables.

#### Data Sources
https://www.kaggle.com/datasets/kunwarakash/chennai-housing-sales-price


### The Data
VARIABLES TO PREDICT WITH


* **AREA**: The area a house is located in Chennai
* **INT_SQFT**: square footage of house 
* **DIST_MAINROAD**: distance of house from main road
* **N_BEDROOM**:  number of bedrooms 
* **N_BATHROOM**: number of bathrooms
* **N_ROOM**: number of rooms
* **SALE_COND**:  condition of house at time of sale
* **PARK_FACIL**:  is parking available or not (Yes or No)
* **BUILD_TYPE**: purpose of house
* **UTILITY_AVAIL**: facilities available at house
* **PAVED**: If street accessing home is paved or not (Yes or No)


VARIABLES WE WANT TO PREDICT

* **SALES_PRICE**:  Sale price of house
* **SALES_PRICEhigh**: Sale Price > $10M coded as Yes, lower coded as No

Data Exploration {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=200}
-------------------------------------

### Data Overview 
From this data we can see that our variables have a variety of different values based on their types. The homes range from being built from 1967 to 2002. All house sales are from 2005 - 2014. The top two areas are Chrompet and Karapakkam which hold about 3/7 of the houses. Square footage median is 1373 and the max is 2500. We can see that there are many typos in the data. For example, `PARK_FACIL` (yes or no, if there is parking at the house) has has "No" and "Noo". Other inconsistencies from categorical variables can be seen across the data. In this data, remember `SALES_PRICEhigh` is just a categorical variable that is Yes if sale price is high (> $10M). 

Column {data-width=450, data-height=600}
-----------------------------------------------------------------------
### View the Data Summaries
Now we can see the range of values for each variable. The `AREA` (House neighborhood) variables is truncated but we can see the values in the bottom table.
```{r, cache=TRUE}
#View data
summary(df)
```

Column {data-width=150, data-height=300}
-----------------------------------------------------------------------
### Average Sales Price by `PAVED` (Street accessing home is paved)
```{r, cache=TRUE}
#Summary table for chas variable
df %>%
  group_by(PAVED) %>%
  summarize(n=n(), mean(SALES_PRICE)) %>%
  kable(digits=2)
```



### Average Sales Price by `AREA` (neighborhood house is in)
```{r, cache=TRUE}
df %>%
  group_by(AREA) %>%
  summarize(n=n(),mean(SALES_PRICE)) %>%
  arrange(-n) %>% 
  kable(digits=2)


```

Data Visualization {data-orientation=rows}
=======================================================================
### Response Variables relationships with predictors

* We can see we have about half of the data as high sales price (>$10M). Looking at the potential predictors related to High Sales Price, we see the strongest relationships with square footage, number of rooms, and likely area.

* We see the largest concentration of values around $8M-$12M. The data is also skewed to the right. We can see a decent increase in number of values around $20M. This is actually due to truncation of the data.

Row {data-height=550}
-----------------------------------------------------------------------
#### High Sales Price

```{r, cache=TRUE}
ggplot(df,aes(x=SALES_PRICEhigh)) + geom_bar()
```

#### Sales Price
```{r, cache=TRUE}
ggplot(df, aes(SALES_PRICE)) + geom_histogram(bins=20)
```


Row {.tabset data-height=450}
-----------------------------------------------------------------------
### Sales Price vs Categorial Variables
```{r, cache=TRUE}
ggpairs(dplyr::select(df,SALES_PRICE,PAVED, BUILDTYPE,
                      UTILITY_AVAIL,SALE_COND,PARK_FACIL, AREA))
```

###  Sales Price vs Continuous Variables
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(df,SALES_PRICE,N_BEDROOM,N_BATHROOM,
                             DIST_MAINROAD,INT_SQFT,N_ROOM)))
```

### High Sales Price vs Continuous Variables
```{r, cache=TRUE}
ggpairs(dplyr::select(df, SALES_PRICEhigh, N_BEDROOM, N_BATHROOM, DIST_MAINROAD, INT_SQFT, N_ROOM))
```

### High Sales Price vs Categorical Variables
```{r, cache=TRUE}
df %>% group_by(PAVED, SALES_PRICEhigh) %>%
  summarize(n=n()) %>%
  ggplot(aes(y=n, x=SALES_PRICEhigh,fill=PAVED)) +
      geom_bar(position="dodge", stat="identity") +
      geom_text(aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) +
      ggtitle("High Sales Price vs Paved Street Access") +
      coord_flip() #makes horizontal
```

Regression Model {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
----------------------------------------------------------------------


### Predicting Continuous Sales Price

For the prediction of the continuous variable house sale price (SALES_PRICE), first we will use linear regression.


Row{data-height=2000, column-width = 700, .tabset .tabset-fade} 
-----------------------------------------------------------------------
### Linear Regression Full


#### Full Model Results

```{r, cache=TRUE}
reg_recipe <- recipe(SALES_PRICE ~ ., data = dplyr::select(df,-SALES_PRICEhigh)) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors()) %>%
  prep()
df_reg_norm <- bake(reg_recipe, df)
#Define the model specification
reg_spec <- linear_reg() %>% ## Class of problem  
   set_engine("lm") %>% ## The particular function that we use  
   set_mode("regression") ## type of model
#Fit the model
reg1_fit <- reg_spec %>%  
   fit(SALES_PRICE ~ .,data = df_reg_norm)
#Capture the predictions and metrics
pred_reg1_fit <- augment(reg1_fit,df_reg_norm)
curr_metrics <- pred_reg1_fit %>%
  metrics(truth=SALES_PRICE,estimate=.pred)
results_reg <- tibble(model = "Linear Model",
                  RMSE = curr_metrics[[1,3]],
                  MAE = curr_metrics[[3,3]],
                  RSQ = curr_metrics[[2,3]]) 
```

#### The Full Regression Model Coefficients
```{r, cache=TRUE}
tidy(reg1_fit) %>%
  kable(digits=2)
```

#### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting the house sale price, so a pruned version of the model is created by removing predictors that are not significant.

```{r, cache=TRUE}
results_reg %>%
  kable(digits = 2)
```
### Linear Regression Final 

For this analysis we will use a pruned Linear Regression Model. We removed Distance from the Mainroad (DIST_MAINROAD), Condition of the house at time of sale, If the street in front of the house is paved, and utilities available.

#### Final Model Results
```{r, cache=TRUE}
reg2_fit <- reg_spec %>%
  fit(SALES_PRICE ~ . -DIST_MAINROAD -SALE_COND_AbNormal -UTILITY_AVAIL_NoSeWa -UTILITY_AVAIL_ELO	-SALE_COND_Adj.Land
      -SALE_COND_Normal.Sale	-SALE_COND_Partiall-SALE_COND_PartiaLl-PAVED_No,data = df_reg_norm)
#Capture the predictions and metrics
pred_reg2_fit <- augment(reg2_fit,df_reg_norm)
curr_metrics <- pred_reg2_fit %>%
  metrics(truth=SALES_PRICE,estimate=.pred)
results_new <- tibble(model = "Linear Final Model",
                  RMSE = curr_metrics[[1,3]],
                  MAE = curr_metrics[[3,3]],
                  RSQ = curr_metrics[[2,3]])
results_reg <- bind_rows(results_reg, results_new)
reg2_mae <- curr_metrics %>%
  filter(.metric=='mae') %>%
  pull(.estimate)

```

#### The Final Regression Model Coefficients
```{r, cache=TRUE}
tidy(reg2_fit) %>%
  kable(digits=2)
```
#### Residual Assumptions Explorations

```{r, cache=TRUE}
library(performance) #v0.10.0 check_model

reg2_fit %>%
  check_model(check=c('linearity','qq'))
```

#### Compare actual (SALES_PRICE) vs predicted (y_hat) for pruned regression model
```{r, cache=TRUE}
#Plot the Actual Versus Predicted Values
ggplotly(ggplot(data = pred_reg2_fit,
            aes(x = .pred, y = SALES_PRICE)) +
          geom_point(col = "#6e0000") +
            geom_abline(slope = 1) +
            ggtitle(paste("Pruned Regression with MAE",round(reg2_mae,2))))
```

```{r, cache=TRUE}
results_reg %>%
  kable(digits=2)
```


### Predicting Categorical sales price

Here is a look at a logistic regression model predicting high sales price.
```{r, cache=TRUE}
#Define the model specification
log_spec <- logistic_reg() %>%
             set_engine('glm') %>%
             set_mode('classification') 

#Fit the model
log_fit <- log_spec %>%
              fit(SALES_PRICEhigh ~ .-SALES_PRICE, data = df)

#Capture the predictions and metrics
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)

pred_log_fit <- augment(log_fit, df)
tidy(log_fit$fit) %>%
  kable(digits=3)
pred_log_fit %>%
    my_class_metrics(truth=SALES_PRICEhigh,estimate=.pred_class) %>%
    select(-.estimator) %>%
    kable(digits = 3, align = 'l')
```


# Regression Tree Analysis {data-navmenu="Regression Models"}

Column {.sidebar data-width=520}
----------------------------------------------------------------------
-
#### Analysis Summary

After examining these two trees we can see that **INT_SQFT** and **BUILDTYPE** are the most important variables for both the original tree.  The next most important variables are **N_ROOM**, **N_BED**, and **AREA**. We can see that 

* If **build type is commercial**, it  **decreases sales price**. 
  
* If **area is velachery**, it also **increases sales price**. 


Row{data-height=2000, column-width = 700, .tabset .tabset-fade} 
-----------------------------------------------------------------------

### Regression Tree
We will predict the home sale price with all the variables.
```{r, cache=TRUE}
#Define the model specification
tree_reg_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")
#Fit the model
tree1_fit <- tree_reg_spec %>%  
   fit(SALES_PRICE ~ .,data = df_reg_norm)
#Capture the predictions and metrics
pred_tree1_fit <- augment(tree1_fit,df_reg_norm)
curr_metrics <- pred_tree1_fit %>%
  metrics(truth=SALES_PRICE,estimate=.pred)
results_new <- tibble(model = "Reg Tree Model",
                  RMSE = curr_metrics[[1,3]],
                  MAE = curr_metrics[[3,3]],
                  RSQ = curr_metrics[[2,3]])
tree1_mae <- curr_metrics %>%
  filter(.metric=='mae') %>%
  pull(.estimate)
results_reg <- bind_rows(results_reg, results_new)

```

#### View the regression tree.
We see it has 8 leaf nodes.
```{r, cache=TRUE}

library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.4
library(MASS) #v7.3-58.1 for Boston data
library(flexdashboard) #v0.6.0
library(rpart) #v 4.1.19 Partition package to create trees
library(rpart.plot) #v 3.1.1 creates nicer tree plots
library(vip) #v0.3.2 vip()
library(tidymodels) 
  #library(parsnip) #v1.1.0 linear_reg(), set_engine(), set_mode(), fit(), predict()
  #library(yardstick) #v1.2.0 metrics(), rac_auc(), roc_curve(), metric_set(), conf_matrix()
  #library(dplyr) #v1.1.2 %>%, select(), select_if(), filter(), mutate(), group_by(), 
    #summarize(), tibble()
  #library(ggplot2) #v3.4.2 ggplot()
  #library(broom) #v1.0.5 for tidy(), augment(), glance()
  #library(rsample) #v1.1.1 initial_split()
library(plotly) #v4.10.1
library(performance) #v0.10.0 check_model
library(see) #v0.7.3 for check_model plots from performance
library(patchwork) #v1.1.1 for check_model plots from performance
library(knitr) #v1.41 kable()

rpart.plot(tree1_fit$fit, roundint=FALSE)
```

#### View the Variable Importance Plot
```{r, cache=TRUE}
vip(tree1_fit)
```

#### Compare actual (SALES_PRICE) vs predicted (y_hat)
```{r, cache=TRUE}
#Plot the Actual Versus Predicted Values
ggplotly(ggplot(data = pred_tree1_fit,
            aes(x = .pred, y = SALES_PRICE)) +
  geom_point(col = "#6e0000") +
  geom_abline(slope = 1) +
  ggtitle(paste("Regression Tree with MAE",round(tree1_mae,2))))
```

#### Compare the Metrics
```{r, cache=TRUE}
results_reg %>%
  kable(digits=2)
```

### Tuned Regression Tree
Will tuning improve performance? We'll use cross validation on the cost complexity and the tree depth.
```{r, cache=TRUE}
#Define the model specification
tree_tune_spec <- decision_tree(cost_complexity = tune(),
                             tree_depth = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")
df_folds <- vfold_cv(df_reg_norm)
tree_grid <- dials::grid_regular(cost_complexity(),
                                   tree_depth(range = c(2, 6)),
                                   levels = 5)
tree2_wf <- workflow() %>%
  add_model(tree_tune_spec) %>%
  add_formula(SALES_PRICE ~ .)
#Tune on the grid of values
tree2_rs <- tree2_wf %>% 
  tune_grid(resamples = df_folds,
            grid = tree_grid)
#finalize the workflow
final_tree_wf <- 
  tree2_wf %>% 
  finalize_workflow(select_best(tree2_rs, metric='rmse'))
final_tree_fit <- 
  final_tree_wf %>%
  fit(data = df_reg_norm) %>%
  extract_fit_parsnip() 
#Capture the predictions and metrics
pred_tree2_fit <- augment(final_tree_fit,df_reg_norm)
curr_metrics <- pred_tree2_fit %>%
  metrics(truth=SALES_PRICE,estimate=.pred)
results_new <- tibble(model = "Tuned Reg Tree Model",
                  RMSE = curr_metrics[[1,3]],
                  MAE = curr_metrics[[3,3]],
                  RSQ = curr_metrics[[2,3]])
tree2_mae = curr_metrics %>%
  filter(.metric=='mae') %>%
  pull(.estimate)
results_reg <- bind_rows(results_reg, results_new)

```

```{r, cache=TRUE}
final_tree_fit$spec
```
#### View the regression tree.
We see it has 25 leaf nodes.
```{r, cache=TRUE}
rpart.plot(final_tree_fit$fit, roundint=FALSE)
```

#### View the Variable Importance Plot
```{r, cache=TRUE}
vip(final_tree_fit)
```

#### Compare actual (SALES_PRICE) vs predicted (y_hat) for tuned tree
```{r, cache=TRUE}
ggplotly(ggplot(data = pred_tree2_fit,
            aes(x = .pred, y = SALES_PRICE)) +
        geom_point(col = "#6e0000") +
        geom_abline(slope = 1) +
        ggtitle(paste("Regression Tuned Tree with MAE",round(tree2_mae,2))))
```


#### Compare the metrics
```{r, cache=TRUE}
results_reg %>%
  kable(digits=2)
```
Classification Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Classification Models
**UPDATE**
When predicting the house sale price high/low variable (SALES_PRICEhigh) we have coded it so that Yes means it is high (> 10,000,000) and No otherwise. For this analysis we will perform a classification tree and an logistic regression. **Both models have a sensitivity of around 91%**. If I had to choose a single model I would choose the classification tree since it is easier to explain.

Row {data-height=2500 .tabset .tabset-fade}
-------------------------------------

### Classification Trees 
We will use all the variables except SALES_PRICE because this is what the SALES_PRICEhigh is created from. For this model we will set the cost complexity to .001.

```{r, cache=TRUE}
class_recipe <- recipe(SALES_PRICEhigh ~ ., data = dplyr::select(df,-SALES_PRICE)) %>%
  step_normalize(all_numeric()) %>%
  prep()
df_class_norm <- bake(class_recipe, df)

tree_class_spec <- decision_tree(cost_complexity=.001) %>%
                    set_engine("rpart") %>%
                    set_mode("classification")
#Fit the model
class_tree1_fit <- tree_class_spec %>%  
   fit(SALES_PRICEhigh ~ .,data = df_class_norm)
#Capture the predictions and metrics
pred_class_tree1_fit <- augment(class_tree1_fit,df_class_norm)
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)
curr_metrics <- pred_class_tree1_fit %>%
  my_class_metrics(truth=SALES_PRICEhigh,estimate=.pred_class)
results_cls <- tibble(model = "Classification Tree Model",
                  Accuracy = curr_metrics[[1,3]],
                  Sensitivity = curr_metrics[[3,3]],
                  Specificity = curr_metrics[[2,3]])
class_tree1_sens <- curr_metrics %>%
  filter(.metric=='sens') %>%
  pull(.estimate)

```

#### Variable Importance
Here we view the variable importance measures. The higher the value, the more important.
```{r, cache=TRUE}
library(vip) #v0.3.2 vip()

vip(class_tree1_fit)
```

#### View the Classification Tree Plot
We can see we have 5 leaf nodes.
```{r, cache=TRUE}

rpart.plot(class_tree1_fit$fit, type=1, extra = 102, roundint=FALSE)
```

#### Confusion matrix
```{r, cache=TRUE}
pred_class_tree1_fit %>%
  conf_mat(truth=SALES_PRICEhigh,estimate=.pred_class)
```

#### View the Metrics
```{r, cache=TRUE}
results_cls %>%
  kable(digits = 2, align = 'l')
```
#### Checking the Cutoff
```{r, cache=TRUE}
#Find Best Threshold cutoff
ROC_threshold <- function(pred_data,truth,probs) {
  #This function finds the cutoff with the max sum of sensitivity and specificity
  #Created tidy version of:
  #http://scipp.ucsc.edu/~pablo/pulsarness/Step_02_ROC_and_Table_function.html
  #The inputs are the prediction table (from augment()) and the columns for the
  #truth and estimate values. The columns need to be strings (i.e., 'sales')
 
  roc_curve_tbl <- pred_data %>% 
                    roc_curve(truth = {{truth}}, {{probs}}) 
  auc = pred_data %>%
              roc_auc(truth = {{truth}}, {{probs}}) %>%
              pull(.estimate)
  best_row = which.max(roc_curve_tbl$specificity + roc_curve_tbl$sensitivity)
  print(paste("Best Cutoff", round(roc_curve_tbl[best_row,'.threshold'],4),
              "Sensitivity", round(roc_curve_tbl[best_row,'sensitivity'],4),
              "Specificity", round(roc_curve_tbl[best_row,'specificity'],4),
              "AUC for Model", round(auc,4)))
}
ROC_threshold(pred_class_tree1_fit,'SALES_PRICEhigh', '.pred_Yes')

#Adding a new cutoff prediction column
pred_class_tree1_fit <- pred_class_tree1_fit %>%
                    mutate(pred_Yes_20 = factor(ifelse(.pred_Yes > .60,"Yes","No"),
                                              levels=c("Yes","No")))
```

#### Confusion matrix for Classification Cutoff 60%
```{r, cache=TRUE}
pred_class_tree1_fit %>%
  conf_mat(truth=SALES_PRICEhigh,estimate=pred_Yes_20)
```

#### Metrics for Classification Cutoff 20%
```{r, cache=TRUE}
curr_metrics <- pred_class_tree1_fit %>%
  my_class_metrics(truth=SALES_PRICEhigh,estimate=pred_Yes_20)
results_new <- tibble(model = "Classification Tree Model 60% Cutoff",
                  Accuracy = curr_metrics[[1,3]],
                  Sensitivity = curr_metrics[[3,3]],
                  Specificity = curr_metrics[[2,3]])
results_cls <- bind_rows(results_cls, results_new)
results_cls %>%
  kable(digits=2, align = 'l')
```



### Logistic Regression
For our final model, we will use logistic regression to explore High Sales Price. We can see the # of rooms, bathrooms, bedreooms & interrior square footage per dwelling are most important in the model.**

#### Logistic Regression Equation
```{r, cache=TRUE}
#Define the model specification
log_spec <- logistic_reg() %>%
             set_engine('glm') %>%
             set_mode('classification') 

#Fit the model
log_fit <- log_spec %>%
              fit(SALES_PRICEhigh ~ ., data = df_class_norm)
tidy(log_fit$fit) %>%
  kable(digits=2)
```

#### Pruned Logistic Regression Equation
```{r, cache=TRUE}
#Fit the model
log2_fit <- log_spec %>%
              fit(SALES_PRICEhigh ~ .-AREA-DIST_MAINROAD-SALE_COND-UTILITY_AVAIL-PARK_FACIL-BUILDTYPE-PAVED, data = df_class_norm)
tidy(log2_fit$fit) %>%
  kable(digits=2)

#Capture the predictions and metrics
pred_log2_fit <- augment(log2_fit,df_class_norm)
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)
curr_metrics <- pred_log2_fit %>%
  my_class_metrics(truth=SALES_PRICEhigh,estimate=.pred_class)
results_new <- tibble(model = "Pruned Logistic Model",
                  Accuracy = curr_metrics[[1,3]],
                  Sensitivity = curr_metrics[[3,3]],
                  Specificity = curr_metrics[[2,3]])
results_cls <- bind_rows(results_cls, results_new)
class_tree1_sens <- curr_metrics %>%
  filter(.metric=='sens') %>%
  pull(.estimate)
```

#### Examine the Confusion Matrix
```{r, cache=TRUE}
pred_log2_fit %>%
  conf_mat(truth=SALES_PRICEhigh,estimate=.pred_class)
```
#### Variable Importance
Here we view the variable importance measures. The higher the value, the more important.
```{r, cache=TRUE}
vip(log2_fit)
```

#### Confusion matrix
```{r, cache=TRUE}
pred_log2_fit %>%
  conf_mat(truth=SALES_PRICEhigh,estimate=.pred_class)
```

#### View the Metrics
```{r, cache=TRUE}
results_cls %>%
  kable(digits = 2, align = 'l')
```
#### Checking the Cutoff
```{r, cache=TRUE}
ROC_threshold(pred_log2_fit, 'SALES_PRICEhigh', '.pred_Yes')

#Adding a new cutoff prediction column
pred_log2_fit <- pred_log2_fit %>%
                    mutate(pred_Yes_25 = factor(ifelse(.pred_Yes > .52,"Yes","No"),
                                              levels=c("Yes","No")))
```

#### Confusion matrix for Logistic Cutoff 52%
```{r, cache=TRUE}
pred_log2_fit %>%
  conf_mat(truth=SALES_PRICEhigh,estimate=pred_Yes_25)
```


#### Metrics for Logistic Cutoff 25%
```{r, cache=TRUE}
curr_metrics <- pred_log2_fit %>%
  my_class_metrics(truth=SALES_PRICEhigh,estimate=pred_Yes_25)
results_new <- tibble(model = "Logistic Model 25% Cutoff",
                  Accuracy = curr_metrics[[1,3]],
                  Sensitivity = curr_metrics[[3,3]],
                  Specificity = curr_metrics[[2,3]])
results_cls <- bind_rows(results_cls, results_new)
results_cls %>%
  kable(digits = 2, align = 'l')
```

Conclusion
=======================================================================

### Summary
**UPDATE**
In Conclusion, we can see that our predictors do help to predict the house sale price, either the high/low sale price (with cutoff at $10,000,000) or the actual sale price.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:


### Predicting Continuous Home Sale Price

In addition, if we compare the models that we examined for predicting continuous home sale price, we see that the regression tree has larger r-squared.

* Linear Regression RSQ: 0.89
* Pruned Regression Tree RSQ: 0.92

#### Summary Metrics Table
```{r, cache=TRUE}
results_reg %>%
  kable(digits=2, align = 'l')
```

#### Actual vs Predicted Plot
```{r, cache=TRUE}
df_act_pred <- bind_rows(
            pred_reg1_fit %>% mutate(model = 'Linear Model'),
            pred_reg2_fit %>% mutate(model = 'Linear Final Model'),
            pred_tree1_fit %>% mutate(model = 'Reg Tree Model'),
            pred_tree2_fit %>% mutate(model = 'Tuned Reg Tree Model')
)

ggplotly(ggplot(df_act_pred, aes(y = .pred, x = SALES_PRICE, color=model)) + 
  geom_point() +
    geom_abline(col = "gold") + 
    ggtitle("Predicted vs Actual Median Value") )
```

### Predicting Categorical Median Value
And if we compare the models we examined for predicting the categorical response high sale price, we see that the classification tree has higher accuracy.

* Classification Tree (cutoff .60) Accuracy .92  Sensitivity .91
* Logistic Regression (cutoff .52) Accuracy .77 Sensitivity .75

#### Summary Metrics Table
```{r, cache=TRUE}
results_cls %>%
  kable(digits=2, align = 'l')
```

#### ROC Curves
```{r, cache=TRUE}
#Capture the auc
log_auc <- pred_log2_fit %>%
  roc_auc(truth=SALES_PRICEhigh, .pred_Yes) %>%
  pull(.estimate)
tree_auc <- pred_class_tree1_fit %>%
  roc_auc(truth=SALES_PRICEhigh, .pred_Yes) %>%
  pull(.estimate)

#Capture the thresholds and sens/spec
df_roc <- bind_rows(pred_log2_fit %>% 
                        roc_curve(truth = SALES_PRICEhigh, .pred_Yes) %>% 
                        mutate(model = paste('Logistic', round(log_auc,2))),
                    pred_class_tree1_fit %>% 
                        roc_curve(truth = SALES_PRICEhigh, .pred_Yes) %>% 
                        mutate(model = paste('Class Tree', round(tree_auc,2))),
)

#Create the ROC Curve(s)
ggplotly(ggplot(df_roc,
        aes(x = 1 - specificity, y = sensitivity,
            group = model, col = model)) +
        geom_path() +
        geom_abline(lty = 3)  +
        scale_color_brewer(palette = "Dark2") +
        theme(legend.position = "top"))

```